CESCProg: a compact prognostic model and nomogram for cervical cancer based on miRNA biomarkers

Cervical squamous cell carcinoma, more commonly cervical cancer, is the fourth common cancer among women worldwide with substantial burden of disease, and less-invasive, reliable and effective methods for its prognosis are necessary today. Micro-RNAs are increasingly recognized as viable alternative biomarkers for direct diagnosis and prognosis of disease conditions, including various cancers. In this work, we addressed the problem of systematically developing an miRNA-based nomogram for the reliable prognosis of cervical cancer. Towards this, we preprocessed public-domain miRNA -omics data from cervical cancer patients, and applied a cascade of filters in the following sequence: (i) differential expression criteria with respect to controls; (ii) significance with univariate survival analysis; (iii) passage through dimensionality reduction algorithms; and (iv) stepwise backward selection with multivariate Cox modeling. This workflow yielded a compact prognostic DEmiR signature of three miRNAs, namely hsa-miR-625-5p, hs-miR-95-3p, and hsa-miR-330-3p, which were used to construct a risk-score model for the classification of cervical cancer patients into high-risk and low-risk groups. The risk-score model was subjected to evaluation on an unseen test dataset, yielding a one-year AUROC of 0.84 and five-year AUROC of 0.71. The model was validated on an out-of-domain, external dataset yielding significantly worse prognosis for high-risk patients. The risk-score was combined with significant features of the clinical profile to establish a predictive prognostic nomogram. Both the miRNA-based risk score model and the integrated nomogram are freely available for academic and not-for-profit use at CESCProg, a web-app (https://apalania.shinyapps.io/cescprog).


INTRODUCTION
Cervical cancer (cervical squamous cell carcinoma; CESC) ranks fourth globally among cancers in women, and second among women of reproductive age.Due to unequal implementation of invasive screening techniques, the morbidity and mortality rate of cervical cancer continues to rise in countries like India, where it accounted for 9.4% of all cancers and 18.3% of new cases in 2020 (Sung et al., 2021).Multiple etiological

MATERIALS AND METHODS
Portions of this text were previously published as a preprint (Muthamilselvan & Palaniappan, 2023).The workflow is summarised in Fig. 1, and discussed in detail below.R version '3.6.3'(http://www.r-project.org) was used in the study.

Data preprocessing
Normalized and log 2 -transformed Illumina HiSeq miRSeq data preprocessed with the TCGA miRNA analysis pipeline were obtained from firebrowse.org portal.The patient barcode of each sample was parsed to annotate the samples as 'normal' and 'cancer'.The corresponding clinical metadata was also retrieved from firebrowse.org (CESC.Merge_Clinical.Level_1.2016012800.0.0.tar) and used to annotate the stage information (encoded in 'patient.stage_event.clinical_stage' variable) of the tumor samples, and then merged with the expression data.The clinical stage is essentially the surgical stage prior to any treatment received, from the biopsy obtained at the time of surgery.Collapsing possible substages (A, B, C) in each stage yielded the four-class macro-progression of stages (I, II, III, IV).Certain demographic and clinical factors in the metadata including age, HPV status, smoking history, pregnancies, histologic grade and vital status were retained.The survival data for each sample is derived based on the attributes 'vital status' and 'days_to_follow_up' in the clinical metadata.Based on the merged dataset, the following dataset pruning was done: (i) samples without stage information were removed; and (ii) miRNAs with negligible change in expression across samples (defined as standard deviation of miRNA expression across samples, σ <1) were also removed.

Linear modelling
The miRNA expression analyses of cancer stages relative to the normal tissue (controls) were performed using the limma version '3.42.2' package in R (Ritchie et al., 2015).The workflow was essentially adapted from protocols developed in our lab (Sarathi & Palaniappan, 2019).
To recapitulate, a linear model of gene expression was fit using controls as intercept and sample stages as indicator variables, according to the following relationship: The fit model was adjusted with empirical Bayes to obtain moderated tstatistics (McCarthy & Smyth, 2009).Multiple hypothesis testing and the false discovery rate were applied using the method of Benjamini and Hochberg to yield adjusted p-values of the F-statistic of the linear fit (Hochberg & Benjamini, 1990).Based on the fold change (FC) in the expression of individual miRNAs across conditions, miRNAs with |log 2 (FC)|>1.5 and adj.p-value <0.05 were considered significantly differentially expressed miRNAs (DEmiRs).The preprocessed dataset was then split into train and test datasets in the ratio 0.8:0.2.The test dataset was kept invisible to the model development process and used only for the performance evaluation of the final model.

Development of compact miRNA signature
Univariate Cox models (Clark et al., 2003) were used to screen the DEmiRs by significance, and only DEmiRs with p-value <0.05 were filtered for further analysis.Two robust feature selection methods, namely Least absolute shrinkage and selector operation (LASSO) Cox regression (Tibshirani, 1997) and Support vector machine -recursive feature elimination (SVM-RFE) (Adorada et al., 2018), were used in combination to reduce the dimensionality of the prognostic DEmiRs.LASSO, a form of 'penalized' regression with L1 penalty, was implemented using R-glmnet version '4.1' (Friedman, Hastie & Tibshirani, 2010), whereas SVM-RFE, which computes ranking weights for all features and then iteratively performs backward selection, was implemented using R-e1071 version '1.7.4' (Dimitriadou et al., 2009).A union of the features selected from these two implementations was taken forward and used in a stepwise multivariate Cox regression (Bradburn et al., 2003) for establishing the prognostic DEmiR signature of cervical cancer.

Prognostic risk model
A risk model was formulated based on the identified prognostic DEmiR signature and used to evaluate the survival risk of each patient.The risk with respect to a unit increase in a given biomarker level is given by exponential of the miRNA coefficient in the following multivariate Cox model: where n is the size of the prognostic DEmiR signature, miRNA i denotes the expression level of the ith miRNA, and β i denotes the effect-size (or weight) of the ith miRNA.
Applying the optimal cut-point (i.e, median) given by maxstat (maximally selected rank) statistic from the R-survminer version '0.4.8' (Kassambara, Kosinski & Biecek, 2020) to the risk score distribution, we categorized (binarized) patients with CESC into high-risk and low-risk groups.Kaplan-Meier curves and AUROC were used to analyze the overall survival (OS) probabilities between high-risk and low-risk groups using R survival version '3.2.7' (Therneau & Grambsch, 2000) and R survival ROC version '1.0.3' (Heagerty, Lumley & Pepe, 2000), respectively.The test dataset and an additional external dataset for blind validation were used to evaluate the prognostic value of the developed model.

Nomogram construction
Since miRNA-based risk score was unlikely to be the only prognostic predictor for overall survival, the clinical profile was also considered.Both univariate and multivariate Cox regression analyses were performed with some clinical features, namely age, pregnancies, smoking_history, grade, stage, and HPV_status.Only the variables that stayed significant in both the models were selected, and used with the miRNA-based risk score to build an integrated risk model and then a nomogram map that tabulates the probabilities of one-year and five-year OS of CESC.The discrimination was quantified using Harrell's concordance index (C-index), and calibration performed using bootstrap with 1,000 resamples.

RESULTS
The TCGA expression data consisted of expression values of 2,589 miRNA in 312 samples enrolled in this study, including 309 cervical cancer tissues and three matched normal tissues.The dataset was pruned in accordance with the pre-processing steps outlined in the Methods section, yielding an expression dataset consisting of 467 miRNAs across 303 samples with stage annotation.The distribution of samples according to the AJCC staging system (Amin et al., 2017) is shown in Table 1.The demographic features and clinical characteristics considered, namely age, smoking history, vital status, pregnancies, HPV status, and histologic grade are summarized in Table 2. Fitting the linear model and applying the filter criteria yielded a total of 101 differentially expressed miRNAs between cervical cancer tissues and matched normal tissues, provided in File S1.Most of the topranked miRNAs are overexpressed (e.g., hsa-miR-200c-3p, hsa-miR-141, hsa-miR-200a, hsa-miR-21-5p), suggesting oncomir function with increased epigenetic suppression of target tumor-suppressor gene expression.Table 3 shows the top ten miRNAs with their stage-wise log 2 FC and linear model significance.
Each DEmiR was subjected to univariate Cox modeling to evaluate its prognostic significance.This process identified only 52 miRNAs as significantly associated with overall survival, based on p-value <0.05 (data presented in File S2).To optimize the dimensions of the prognostic miRNA biomarker panel, we applied Lasso-penalized Cox regression on the 52 miRNAs to obtain five miRNAs, hsa-miR-625-5p, hsa-miR-3934-5p, hsa-miR-330-3p, hsa-miR-642a-5p, has-miR-95-3p.Only one miRNA, hsa-miR-616-5p, survived the SVM-RFE feature selection process.Figure 2 shows the union of these results (i.e, the six miRNAs).All the prognostic DEmiRS post feature selection processes were upregulated, but none were an outlier DEmiR.The six miRNAs were taken forward for multivariate survival analysis, and subjected to a stepwise backward-selection process, to further compact the miRNA signature.This process yielded an optimal signature of three miRNAs namely hsa-miR-625-5p, hsa-miR-330-3p, and hsa-miR-95-3p, with model p-value <0.002 (Table 4), for construction of the CESC prognostic risk model.The CESC prognostic risk model, given by Eq. ( 2), was then parameterized using the expression of these three miRNAs: miRNA_Risk_score = 0.30 * hsa-miR-95-3p + 0.35 * hsa-miR-330-3p − 0.52 * hsa-miR-625-5p. (3) It is seen that hsa-miR-625-5p has a significant protective effect on CESC OS, whereas the expression of hsa-miR-95-3p and hsa-miR-330-3p elevate the risk.Based on this model, we computed the risk score for each patient in the train dataset, and used the maxstat of the resulting risk-score distribution to separate patients into high-and low-risk groups (Fig. 3A).The Kaplan-Meier survival curve of this distribution revealed significantly worse prognosis in the high-risk group (p-value <1E-4) (Fig. 3B).Time-dependent ROC analysis of the risk-score model on the train dataset for 1-, 2-, 3-, and 5-year overall survival yielded prognostic AUC values of 0.71, 0.72, 0.74 and 0.73, respectively (Fig. 3C).These results encouraged validation of the CESC-related prognostic signature on the test dataset, whose risk-score distribution is shown in Fig. 4A.The following outcomes validated the results: (i) Kaplan-Meier survival curve showed significantly worse prognosis in the high-risk group ( p-value <1E-4) (Fig. 4B) ; and (ii) time-dependent AUROC values 0.84, 0.79, 0.71 and 0.71 were obtained for 1-, 2-, 3-, and 5-year overall survival, respectively (Fig. 4C).
To validate the prognostic value of the model on an external, out-of-domain dataset, we used the study results of How et al. (2015).This study used a TaqMan Low Density Array (TLDA) to measure expression in formalin-fixed paraffin-embedded (FFPE) cervix samples.Two datasets from the study were used: (i) Normalized and log 2 -transformed miRNA expression data of 87 FFPE cervix samples used for validation, available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4399941 and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4399941.The clinical information was used to annotate the samples, and the expression subset corresponding to the three miRNAs in the optimal risk model (i.e., (3)) was extracted.Since the miRNA arm information(-3p or -5p) was missing for hsa-miR-625 and hsa-miR-95, the arm-neutral expression values for both these miRNAs were used.The risk score for each sample was calculated based on (3), and the resulting risk score distribution was stratified into high-risk and low-risk patient groups based on the maxstat statistic computed by R survminer.The curves were visualized using Kaplan-Meier analysis, yielding significantly worse prognosis (P = 0.032) in the high-risk patient group relative to the low-risk group (Fig. 5).
Certain clinical features namely age, HPV_status, pregnancies, smoking_history, histologic_grade, and stage could boost the prognostic predictive value, and hence were examined for candidate inclusion in the risk model.Each clinical feature was subjected to univariate Cox survival analysis.Surprisingly, neither the patient's HPV-status nor tumor grade was significant to CESC OS, and moreover only one clinical feature turned This 'Stage' constitutes a risk factor independent of the miRNA-based risk-score, and was used with the miRNA_risk_score to construct an integrated multivariate Cox regression model.As a result, both factor levels of both the variables were significant, and the overall multivariate model was extremely significant (p-value ∼4E-05) (Fig. 6).The integrated CESC prognostic risk model was then parameterized as: Based on the risk models developed, a nomogram was built to predict one-year and five-year survival probabilities (Fig. 7).The nomogram C-index of the integrated model was estimated as 0.7136 ± 0.047, compared with a C-index of 0.6399 ± 0.044 for the miRNAbased risk model alone, thus differentiating the models on the basis of discrimination.Furthermore, the nomogram calibration plots of the integrated model for one-year and five-year OS probabilities based on bootstrap resampling showed consistency between the predicted and actual survival probabilities (Fig. 8).

DISCUSSION
MiRNAs add a layer of critical regulatory control over genomic expression, and aberrations in their expression could lead to the development of cancer hallmarks (Iorio & Croce, 2012).MiRNAs could be detected in the serum, and lend valuable potential as diagnostic and prognostic biomarkers of various cancers, including cervical cancer (Pisarska & Baldy-Chudzik, 2020).Several prognostic miRNAs for CESC have been reported, including miR-31 (Wang et al., 2014), miR-155 (Fang et al., 2016), and miR425-5p (Sun et al., 2017).However a systematic hypothesis-free scan for comprehensive miRNA signatures remains missing in the literature.In this study, we have attempted to fill this void with an integrated multi-layered bioinformatics approach to the detection of a reliable prognostic DEmiR biomarker signature.The study has yielded three prognostic miRNAs, namely hsa-miR-625-5p, hsa-miR-95-3p, and hsa-miR-330-3p.Downregulation of hsa-miR-625-5p has been documented in many cancers including non-small cell lung cancer (Dao et al., 2020), hepatocellular carcinoma (Zhou et al., 2015), melanoma (Zou et al., 2019) and cervical cancer (Wang et al., 2020).A causal mechanism relating miR-625-5p expression to inhibition of cervical cancer cell growth via suppression of NF-κB signaling has been reported (Li et al., 2020), consistent with its mirsupp identity disclosed here.Sponging miR-625-5p in turn is likely to drive cervical cancer progression, and this has been demonstrated recently (Li et al., 2023).Jafarzadeh et al. suggested that miR-330-3p promoted protumorigenic events in various cancers like those of the lung, pancreas, bladder, and cervix, and that its downregulation could stall tumor development (Jafarzadeh et al., 2022), both observations consistent with its oncomir identity disclosed here.Further, miR-95-3p has been implicated in activating the wnt/ βcatenin pathway in prostate cancer tissues (Xi et al., 2019), thereby promoting cell proliferation, migration and invasion, consistent with its oncomir identity disclosed here.
To examine the network-level effects of these miRNAs, we retrieved the RNA-Seq transcriptome for each patient in our dataset from firebrowse.org, and correlated this data with the expression of the three miRNAs of interest to infer potential target genes.Target genes with substantial inverse correlation in expression (defined as Pearson ρ or Spearman ρ or Kendall τ <−0.3) were identified, and the consensus with multiMiR version '1.8.0' (Ru et al., 2014) predictions for each of the three miRNAs was investigated.This yielded three consensus target genes with respect to hsa-miR-95-3p, namely NXPH3, BOC, EID1; two consensus target genes with respect to hsa-miR-625-5p, namely SIN3B and TPRG1L; and two consensus target genes with respect to hsa-miR-330-3p, namely THRA and DYRK2.Functional enrichment analysis of the consensus genes conducted with miRNeT (Chang et al., 2020) on GO and KEGG databases yielded significance for cancer pathways and cell cycle regulation.We also used the miR2Trait server (Babu & Palaniappan, 2022) to investigate the diseasome of this three-miRNA signature, and found significance for 'uterine cervical neoplasm' (p-value ∼1.5E-3), 'squamous cell carcinoma' (p-value ∼7.7E-3), and 'cervical Though all stages of cervical cancer were sufficiently represented in the dataset we have used, it is noted that the dataset was imbalanced with respect to controls.To investigate if the observations in our study generalize when more controls were included in the analysis, we obtained an miRNA-based GEO (Edgar, Domrachev & Lash, 2002) study GSE104758 (Snoek et al., 2019), containing 32 controls and 24 carcinoma in situ CIN3 samples, and concerned with identifying biomarkers for detecting precancerous CIN3 lesions.The Sequence Read Archive of the 32 controls was imported into Galaxy (The Galaxy Community, 2022), checked for FastQC quality, and trimmed for adapters.The reads were then mapped with the reference human genome hg_38 to yield genome-wide miRNA quantification data.This dataset was batch-corrected using ComBat from SVA version '3.35.2' (Leek et al., 2012), and used to augment the normals in the CESC TCGA cohort, yielding 35 normal and 300 cancer samples.The corresponding expression matrix was then modelled according to (1) in limma, as outlined in the Methods section.Using an adj.p-val threshold ∼0.05 yielded a total of 544 DE miRNAs, including all the 101 original DEmiRs and specifically the three miRNAs of the prognostic miRNA_Risk_score model derived above.The detailed results of the linear model obtained with this controls-augmented dataset are provided in File S4.Additional supporting evidence for our results were then obtained as follows: (i) A contrast between the normals and CIN3 samples of GSE104758 yielded 64 DEmiRs with p-value <0.05.This was compared with the nine miRNAs in ref. 44 obtained by applying group-regularised ridge regression.Except hsa-let-7b, the remaining eight miRNAs were common to both the analyses, viz.hsa-miR-9-5b, hsa-miR-15b-3p, hsa-miR-20a-5p, hsa-miR-31-5p, hsa-miR-93-5p, hsa-miR-183-5p, hsa-miR-184, and hsa-miR-222-3p.The GSE104758 study further narrowed down to a biomarker signature of just five miRNAs(viz.hsa-miR-93-5p, hsa-miR-15b-3p, and hsa-miR-222-3p, and hsa-let-7b) via multivariate logistic regression with backward selection.Application of SVM-RFE to the 64 DEmiRs obtained from our contrast analysis identified 24 DEmiRs, including all the miRNAs in the GSE104758 biomarker panel except hsa-let-7b.It is notable that the process also yielded hsa-miR-330-3p, ranked nine among the 24 selected features.(ii) Univariate Cox analysis with the batch-corrected cancer samples in the augmented dataset yielded 241 significant miRNAs, including hsa-miR-95-3p and hsa-miR-625-5p.
Nomograms are widely used for simplifying the task of interpretation from models, and have been constructed with miRNAs for cervical cancer screening (Kotani et al., 2022), prognosis (Liu et al., 2020), and recurrence risk (Bogani et al., 2019).To facilitate the ready prognosis of cervical cancer patients, the models developed in this work were re-built with the full (train + test) dataset, and served as a web-app named CESCProg, deployed at: https://apalania.shinyapps.io/cescprog/ for non-commercial uses.The concerned user may provide the form inputs, namely the expression values of the three prognostic DEmiRs and an optional sample staging information.Based on the user request, the app proceeds to classify the risk of the sample, and compute a risk-score based on (2) or (4).The calculated risk-score is then consulted with the back-end nomogram to estimate the one-year and five-year survival probabilities.Serum-based or cervical mucus-based miRNAs are minimally invasive, and could be detected and quantified using a range of techniques (e.g., see ref. Baabu et al., 2022).

CONCLUSIONS
MiRNA biomarkers are an emerging diagnostic and prognostic aid to the management of disease, especially cancers.Here we present CESCProg, an miRNA-based prognostic model for cervical cancer developed by applying a sequence of purifying filters to the TCGA CESC dataset.All three miRNAs in the panel, namely hsa-miR-95-3p, hsa-miR-330-3p and hsa-miR-625-5p, show upregulation in cervical cancer relative to controls, suggesting feasibility for detection as biomarkers.In the miRNA risk model, hsa-miR-625-5p exhibits a protective effect on OS, while the other two miRNAs elevate the risk.The miRNA risk model was effective and extremely significant in stratifying CESC OS on the test dataset.A second risk model was developed with the inclusion of clinical features to maximize nomogram discrimination.This yielded a C-index of 0.7136 ± 0.047.The models have been deployed as a web-service as a possible aid to medical decision-making.They are available for non-profit use at: https://apalania.shinyapps.io/cescprog.

Figure 1
Figure 1 The workflow used in this study for the development of a compact validated risk model for cervical cancer prognosis.The predictive prognostic nomogram was re-built with the full dataset prior to deployment at CESC-PROG (https://apalania.shinyapps.io/cescprog).Full-size DOI: 10.7717/peerj.15912/fig-1

Figure 2
Figure 2 Volcano plot of the expression distribution of the miRNAs with non-trivial expression patterns in cancer samples relative to controls, highlighting the upregulated and downregulated DEmiRs, and the prognostic DEmiRs post the feature selection process.Any potential outliers would be located in the top-right space.X -axis denotes log 2 (FC) of expression with respect to control, and the Y -axis denotes the −log 10 transformation of the p-value significance of the linear model for the respective miRNA.Full-size DOI: 10.7717/peerj.15912/fig-2

Figure 3
Figure 3 Performance of the constructed risk-score model on train dataset.(A) This panel shows the risk-score value (top), survival status (middle), and expression of the three prognostic miRNAs (bottom) for each patient, sorted by the risk-score distribution.Patients were stratified into low-risk (blue) and high-risk (red) groups according to the risk-score value.The patterns in the expression profiles accord with the signed risk of the respective miRNAs.(B) Kaplan-Meier survival curves based on the three-miRNA prognostic signature showing significant difference between the two groups.(C) Time-dependent ROC curves for 1-, 2-, 3-, and 5-year overall survival predictions using the given model.Full-size DOI: 10.7717/peerj.15912/fig-3

Figure 4
Figure 4 Performance evaluation of the constructed risk-score model on unseen test dataset.(A) This panel shows the risk-score value (top), survival status (middle), and expression of the three prognostic miRNAs (bottom) for each patient, sorted by the risk-score distribution.Patients were stratified into lowrisk (blue) and high-risk (red) groups according to the median risk-score value.(B) Kaplan-Meier survival curves based on the three-miRNA prognostic signature showing significant difference between the two groups.(C) Time-dependent ROC curves for 1-, 2-, 3-, and 5-year overall survival predictions using the given model.Full-size DOI: 10.7717/peerj.15912/fig-4

Figure 5
Figure 5 Kaplan-Meier survival curves for the validation dataset, showing significantly worse prognosis for the high-risk patient group relative to the low-risk group.95% confidence bands for the risk groups are also shown.Full-size DOI: 10.7717/peerj.15912/fig-5

Figure 8
Figure 8 Nomogram calibration curves.(A) 1-year OS probability; (B) 5-year OS probability.The four sub-cohorts of the dataset are visualized, and the corresponding x represents the bootstrap-corrected estimates of the nomogram performance along with the standard error.The solid line compares the nomogram performance with the reference truth.Full-size DOI: 10.7717/peerj.15912/fig-8

Table 1 Distribution of cases by stage
. AJCC staging is represented by the TNM (Tumor-Node-Metastasis) code.Control refers to matched normal samples, and 'NA' denotes cases with unavailable stage information.

Table 2 Clinical profile of cervical cancer patients
. Summary of key clinical/demographic features of the dataset.For ordinal/continuous variables (age, smoking_history, and pregnancies), the mean ± standard deviation is given.Histologic grade refers to the degree of differentiation in the cancer sample.It is seen that most cervical cancer patients present with HPV+ status.

Table 3 Top 10 miRNAs of the linear model.
The log-fold change expression of the miRNA in each stage relative to the controls is given, followed by p-value adjusted for multiple hypothesis testing.

Table 4 Summary of the Cox survival analysis.
It is seen that hsa-miR-625-5p has a significant protective effect on CESC OS, in contrast with hsa-miR-95-3p and hsa-miR-330-3p.The overall multivariate model is very significant with p-value < 0.002.HR denotes hazard rate, and CI confidence interval.